The file ‘WeatherEDA_202005_v002.Rmd’ contains exploratory data analysis for historical weather data as contained in METAR archives hosted by Iowa State University.
Data have been dowloaded, processed, cleaned, and integrated for several stations (airports) and years, with .rds files saved in “./RInputFiles/ProcessedMETAR”.
This module will perform initial modeling on the processed weather files.
There are three main processed files available for further exploration:
metar_postEDA_20200617.rds
metar_modifiedClouds_20200617.rds
metar_precipLists_20200617.rds
Glimpses of the three main files are as follows:
# The tidyverse library will be used throughout
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Main weather data
metarData <- readRDS("./RInputFiles/ProcessedMETAR/metar_postEDA_20200617.rds")
glimpse(metarData)
## Observations: 201,216
## Variables: 41
## $ source <chr> "kdtw_2016", "kdtw_2016", "kdtw_2016", "kdtw_2016", "...
## $ locale <chr> "Detroit, MI (2016)", "Detroit, MI (2016)", "Detroit,...
## $ dtime <dttm> 2016-01-01 00:53:00, 2016-01-01 01:53:00, 2016-01-01...
## $ origMETAR <chr> "KDTW 010053Z 23012KT 10SM OVC020 00/M05 A3021 RMK AO...
## $ year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,...
## $ monthint <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ month <fct> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ WindDir <chr> "230", "230", "230", "240", "230", "220", "220", "220...
## $ WindSpeed <int> 12, 12, 11, 14, 16, 13, 14, 16, 13, 16, 17, 13, 15, 1...
## $ WindGust <dbl> NA, NA, NA, 23, 22, NA, 20, 20, NA, 22, NA, NA, 22, 2...
## $ predomDir <fct> SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, S...
## $ Visibility <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 8, 5, 7, 8, 10, 10, 7...
## $ Altimeter <dbl> 30.21, 30.21, 30.19, 30.19, 30.18, 30.16, 30.14, 30.1...
## $ TempF <dbl> 32.00, 32.00, 32.00, 30.92, 30.92, 32.00, 30.92, 30.9...
## $ DewF <dbl> 23.00, 21.92, 21.02, 19.94, 19.94, 19.94, 19.94, 19.0...
## $ modSLP <dbl> 1023.6, 1023.5, 1023.0, 1023.0, 1022.7, 1022.0, 1021....
## $ cType1 <chr> "OVC", "OVC", "OVC", "OVC", "OVC", "OVC", "OVC", "OVC...
## $ cType2 <chr> "", "", "", "", "", "", "", "", "", "OVC", "OVC", "OV...
## $ cType3 <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "...
## $ cType4 <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "...
## $ cType5 <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "...
## $ cType6 <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "...
## $ cLevel1 <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, 2800,...
## $ cLevel2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 4500, 4000, 4000,...
## $ cLevel3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ cLevel4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ cLevel5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ cLevel6 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ isRain <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS...
## $ isSnow <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS...
## $ isThunder <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS...
## $ p1Inches <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, NA, 0,...
## $ p36Inches <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, 0, NA, NA,...
## $ p24Inches <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ tempFHi <dbl> NA, NA, NA, NA, 36, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ tempFLo <dbl> NA, NA, NA, NA, 31, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ minHeight <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, 2800,...
## $ minType <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN, SCT...
## $ ceilingHeight <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, 2800,...
## $ ceilingType <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN, OVC...
# Extra clouds data
cloudData <- readRDS("./RInputFiles/ProcessedMETAR/metar_modifiedClouds_20200617.rds")
glimpse(cloudData)
## Observations: 393,258
## Variables: 6
## $ source <chr> "kdtw_2016", "kdtw_2016", "kdtw_2016", "kdtw_2016", "kdt...
## $ sourceName <chr> "Detroit, MI (2016)", "Detroit, MI (2016)", "Detroit, MI...
## $ dtime <dttm> 2016-01-01 00:53:00, 2016-01-01 00:53:00, 2016-01-01 01...
## $ level <dbl> 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,...
## $ height <dbl> -100, 2000, -100, 2000, -100, 2200, -100, 2500, -100, 27...
## $ type <fct> CLR, OVC, CLR, OVC, CLR, OVC, CLR, OVC, CLR, OVC, CLR, O...
# Precipitation summaries
precipData <- readRDS("./RInputFiles/ProcessedMETAR/metar_precipLists_20200617.rds")
names(precipData)
## [1] "rain2015" "rain2016" "rain2017" "snow2015" "snow2016"
## [6] "snow2017" "thunder2015" "thunder2016" "thunder2017"
names(precipData$rain2016)
## [1] "precipList" "precipLength"
names(precipData$rain2016$precipList)
## [1] "kdtw_2016_RA" "kewr_2016_RA" "kgrb_2016_RA" "kgrr_2016_RA" "kiah_2016_RA"
## [6] "kind_2016_RA" "klas_2016_RA" "klnk_2016_RA" "kmke_2016_RA" "kmsn_2016_RA"
## [11] "kmsp_2016_RA" "kmsy_2016_RA" "kord_2016_RA" "ksan_2016_RA" "ktvc_2016_RA"
names(precipData$rain2016$precipList$kdtw_2016_RA)
## [1] "pStateFrame" "pIssueTimes" "mismatches" "mismatchTimes"
## [5] "beginEndInterval"
glimpse(precipData$rain2016$precipLength)
## Observations: 180
## Variables: 5
## $ source <chr> "kdtw_2016", "kdtw_2016", "kdtw_2016", "kdtw_2016", "kdtw_20...
## $ locale <chr> "Detroit, MI (2016)", "Detroit, MI (2016)", "Detroit, MI (20...
## $ month <fct> Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, ...
## $ hours <dbl> 27.928889, 20.828333, 77.274722, 73.072778, 46.488611, 20.12...
## $ events <dbl> 16, 18, 31, 38, 41, 31, 31, 40, 49, 29, 41, 21, 20, 41, 30, ...
glimpse(precipData$rain2016$precipList$kdtw_2016_RA)
## List of 5
## $ pStateFrame :Classes 'tbl_df', 'tbl' and 'data.frame': 8818 obs. of 5 variables:
## ..$ origMETAR: chr [1:8818] "KDTW 310053Z 23007KT 10SM BKN025 OVC050 02/M03 A3017 RMK AO2 SLP223 T00221033" "KDTW 310153Z 22009KT 10SM OVC021 02/M03 A3019 RMK AO2 SLP229 T00221028" "KDTW 310253Z 24009KT 10SM OVC021 02/M03 A3018 RMK AO2 SLP224 T00171028 50004" "KDTW 310353Z 24010KT 10SM OVC025 02/M04 A3018 RMK AO2 SLP225 T00171039" ...
## ..$ dtime : POSIXct[1:8818], format: "2015-12-31 00:53:00" "2015-12-31 01:53:00" ...
## ..$ strPrecip: chr [1:8818] NA NA NA NA ...
## ..$ curPrecip: logi [1:8818] FALSE FALSE FALSE FALSE FALSE FALSE ...
## ..$ lagPrecip: logi [1:8818] NA FALSE FALSE FALSE FALSE FALSE ...
## $ pIssueTimes :Classes 'tbl_df', 'tbl' and 'data.frame': 7 obs. of 1 variable:
## ..$ dtime: POSIXct[1:7], format: "2016-02-24 11:53:00" "2016-04-01 00:53:00" ...
## $ mismatches : int(0)
## $ mismatchTimes : 'POSIXct' num(0)
## - attr(*, "tzone")= chr "UTC"
## $ beginEndInterval:Classes 'tbl_df', 'tbl' and 'data.frame': 387 obs. of 3 variables:
## ..$ beginTimes: POSIXct[1:387], format: "2015-12-31 17:45:00" "2016-01-08 16:27:00" ...
## ..$ endTime : POSIXct[1:387], format: "2015-12-31 17:53:00" "2016-01-09 00:13:00" ...
## ..$ precipInts:Formal class 'Interval' [package "lubridate"] with 3 slots
There are a few functions and mappings that can help with the modeling process:
# The process frequently uses tidyverse and lubridate
library(tidyverse)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
# The main path for the files
filePath <- "./RInputFiles/ProcessedMETAR/"
# Descriptive names for key variables
varMapper <- c(source="Original source file name",
locale="Descriptive name",
dtime="Date-Time (UTC)",
origMETAR="Original METAR",
year="Year",
monthint="Month",
month="Month",
day="Day of Month",
WindDir="Wind Direction (degrees)",
WindSpeed="Wind Speed (kts)",
WindGust="Wind Gust (kts)",
predomDir="General Prevailing Wind Direction",
Visibility="Visibility (SM)",
Altimeter="Altimeter (inches Hg)",
TempF="Temperature (F)",
DewF="Dew Point (F)",
modSLP="Sea-Level Pressure (hPa)",
cType1="First Cloud Layer Type",
cLevel1="First Cloud Layer Height (ft)",
isRain="Rain at Observation Time",
isSnow="Snow at Observation Time",
isThunder="Thunder at Obsevation Time",
tempFHi="24-hour High Temperature (F)",
tempFLo="24-hour Low Temperature (F)",
minHeight="Minimum Cloud Height (ft)",
minType="Obscuration Level at Minimum Cloud Height",
ceilingHeight="Minimum Ceiling Height (ft)",
ceilingType="Obscuration Level at Minimum Ceiling Height",
hr="Hour of Day (Zulu time)"
)
# File name to city name mapper
cityNameMapper <- c(kdtw_2016="Detroit, MI (2016)",
kewr_2016="Newark, NJ (2016)",
kgrb_2016="Green Bay, WI (2016)",
kgrr_2016="Grand Rapids, MI (2016)",
kiah_2016="Houston, TX (2016)",
kind_2016="Indianapolis, IN (2016)",
klas_2015="Las Vegas, NV (2015)",
klas_2016="Las Vegas, NV (2016)",
klas_2017="Las Vegas, NV (2017)",
klnk_2016="Lincoln, NE (2016)",
kmke_2016="Milwaukee, WI (2016)",
kmsn_2016="Madison, WI (2016)",
kmsp_2016="Minneapolis, MN (2016)",
kmsy_2015="New Orleans, LA (2015)",
kmsy_2016="New Orleans, LA (2016)",
kmsy_2017="New Orleans, LA (2017)",
kord_2015="Chicago, IL (2015)",
kord_2016="Chicago, IL (2016)",
kord_2017="Chicago, IL (2017)",
ksan_2015="San Diego, CA (2015)",
ksan_2016="San Diego, CA (2016)",
ksan_2017="San Diego, CA (2017)",
ktvc_2016="Traverse City, MI (2016)"
)
The caret package allows for many types of models to be called on the data.
A function is written to subset the data and create appropriate train-test splits:
# Create test and train data, filtered for locales and months, and keeping only relevant variables
createTestTrain <- function(df,
sources=NULL,
vrbls=NULL,
convFactor="locale",
months=1:12,
testSize=0.3,
noNA=TRUE,
seed=NULL
) {
# FUNCTION ARGUMENTS:
# df: the data frame or tibble containing the data
# sources: the source records to be included (NULL will include all)
# vrbls: the variables to be included (NULL will include all)
# convFactor: variable to convert to a factor after processing (NULL for none)
# months: the months to be included
# testSize: the fraction of observations to be included as test
# seed: the seed for reproducibility
# Set the seed if it has been passed
if (!is.null(seed)) { set.seed(seed) }
# Set sources to be all sources if NULL
if (is.null(sources)) {
sources <- df %>%
count(source) %>%
pull(source)
}
# Set vrbls to be all variables if NULL
if (is.null(vrbls)) { vrbls <- names(df) }
# Filter the data for the relevant sources and months, and limit to vars
dfMod <- df %>%
filter(source %in% sources, monthint %in% months) %>%
select_at(vars(all_of(vrbls)))
# Use only complete cases if noNA=TRUE
if (noNA) {
dfMod <- dfMod %>%
filter(complete.cases(dfMod))
}
# Convert the key variable to a factor if requested
if (!is.null(convFactor)) {
dfMod <- dfMod %>%
mutate_at(vars(all_of(convFactor)), factor)
}
# Split in to test and train (do not sort so that records by locale do not end up in any given order)
idx <- sample(1:nrow(dfMod), round((1-testSize) * nrow(dfMod)), replace=FALSE)
# Return the test and train data
list(testData=dfMod[-idx, ],
trainData=dfMod[idx, ]
)
}
As an example, data for 2016 are pulled for two of the cities - Las Vegas and New Orleans:
ttLists <- createTestTrain(metarData,
sources=c("klas_2016", "kmsy_2016"),
vrbls=c("locale", "month", "predomDir", "TempF", "DewF", "Altimeter", "modSLP"),
seed=2006181356
)
ttLists
## $testData
## # A tibble: 5,246 x 7
## locale month predomDir TempF DewF Altimeter modSLP
## <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Las Vegas, NV (2016) Jan NE 37.0 8.06 30.2 1024
## 2 Las Vegas, NV (2016) Jan NE 35.1 8.96 30.2 1024
## 3 Las Vegas, NV (2016) Jan S 30.0 8.96 30.2 1024.
## 4 Las Vegas, NV (2016) Jan S 27.0 8.06 30.2 1025.
## 5 Las Vegas, NV (2016) Jan S 28.9 8.06 30.2 1025.
## 6 Las Vegas, NV (2016) Jan S 37.9 12.0 30.2 1023.
## 7 Las Vegas, NV (2016) Jan 000 39.0 12.9 30.2 1023.
## 8 Las Vegas, NV (2016) Jan NE 42.1 12.9 30.2 1022.
## 9 Las Vegas, NV (2016) Jan 000 41 12.9 30.2 1022.
## 10 Las Vegas, NV (2016) Jan N 51.1 18.0 30.2 1021.
## # ... with 5,236 more rows
##
## $trainData
## # A tibble: 12,242 x 7
## locale month predomDir TempF DewF Altimeter modSLP
## <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Las Vegas, NV (2016) Jul SW 102. 21.9 29.7 1004
## 2 New Orleans, LA (2016) Aug SW 78.1 73.9 29.9 1012.
## 3 New Orleans, LA (2016) May E 81.0 72.0 29.9 1012.
## 4 Las Vegas, NV (2016) Dec 000 61.0 36.0 30.0 1017.
## 5 New Orleans, LA (2016) Dec VRB 62.1 57.0 30.1 1019.
## 6 New Orleans, LA (2016) Mar S 72.0 59 30.0 1016
## 7 Las Vegas, NV (2016) Apr S 75.0 12.0 29.8 1008.
## 8 Las Vegas, NV (2016) Nov S 62.1 23 29.9 1011.
## 9 Las Vegas, NV (2016) Dec S 37.0 25.0 30.4 1030.
## 10 New Orleans, LA (2016) Aug E 93.9 75.0 30.0 1016.
## # ... with 12,232 more rows
A simple random forest can then be applied to the data, with several options for complexity:
# Create a tuning grid and run the models
trGrid <- expand.grid(min.node.size=c(1, 5, 10, 25, 100),
mtry=c(1, 2, 3, 4, 5, 6),
splitrule=c("gini")
)
caretModel <- caret::train(locale ~ .,
data=ttLists$trainData,
method="ranger",
tuneGrid=trGrid,
trControl=caret::trainControl(method="cv", number=5),
num.trees=50
)
caretModel
## Random Forest
##
## 12242 samples
## 6 predictor
## 2 classes: 'Las Vegas, NV (2016)', 'New Orleans, LA (2016)'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 9793, 9793, 9795, 9793, 9794
## Resampling results across tuning parameters:
##
## min.node.size mtry Accuracy Kappa
## 1 1 0.8427581 0.6855732
## 1 2 0.9449438 0.8898891
## 1 3 0.9732885 0.9465764
## 1 4 0.9825196 0.9650387
## 1 5 0.9842349 0.9684695
## 1 6 0.9843985 0.9687966
## 5 1 0.8451177 0.6903017
## 5 2 0.9421621 0.8843279
## 5 3 0.9742687 0.9485368
## 5 4 0.9810491 0.9620978
## 5 5 0.9830913 0.9661822
## 5 6 0.9844800 0.9689597
## 10 1 0.8492887 0.6986454
## 10 2 0.9402060 0.8804172
## 10 3 0.9716545 0.9433086
## 10 4 0.9803137 0.9606268
## 10 5 0.9829277 0.9658549
## 10 6 0.9839898 0.9679792
## 25 1 0.8367047 0.6734723
## 25 2 0.9436314 0.8872678
## 25 3 0.9703475 0.9406938
## 25 4 0.9774547 0.9549087
## 25 5 0.9802316 0.9604627
## 25 6 0.9812124 0.9624242
## 100 1 0.8462636 0.6925740
## 100 2 0.9366941 0.8733933
## 100 3 0.9594830 0.9189650
## 100 4 0.9684685 0.9369361
## 100 5 0.9698577 0.9397143
## 100 6 0.9718186 0.9436358
##
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 6, splitrule = gini
## and min.node.size = 5.
The model is not worried about over-fitting, choosing the smallest minimum node size and the largest mrty. How well does this model work on the test data?
# Run the best parameters from ranger in randomForest
caretBest <- randomForest::randomForest(locale ~ .,
data=ttLists$trainData,
ntree=50,
nodesize=1,
mtry=6
)
caretBest
##
## Call:
## randomForest(formula = locale ~ ., data = ttLists$trainData, ntree = 50, nodesize = 1, mtry = 6)
## Type of random forest: classification
## Number of trees: 50
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 1.47%
## Confusion matrix:
## Las Vegas, NV (2016) New Orleans, LA (2016) class.error
## Las Vegas, NV (2016) 6034 95 0.01550008
## New Orleans, LA (2016) 85 6028 0.01390479
The model estimates an OOB error rate of between 1%-2%, suggesting a strong ability to differentiate Las Vegas from New Orleans.
A function is built to assess performance on the test data:
# Predictions and confusion matrices on test data
evalPredictions <- function(model,
testData,
printAll=TRUE,
printCM=printAll,
printConfSummary=printAll,
printConfTable=printAll,
showPlots=TRUE
) {
# FUNCTION ARGUMENTS:
# model: a trained model
# testData: the test data to apply the model against
# printAll: whether to print the text summaries (by default, carries to the next three arguments)
# printCM: whether to print the confusion matrix
# printConfSummary: whether to print the voting summary (rough proxy for confidence) of the predictions
# printConfTable: whether to summarize the accuracy by voting summary
# showPlots: whether to display plots related to the outputs
# Get the predicted class and probabilities
testClass <- predict(model, newdata=testData)
testProbs <- predict(model, newdata=testData, type="prob")
if (printCM) {
print(caret::confusionMatrix(testClass, testData$locale))
}
# Create a tibble containing class prediction, maximum probability, and individual predictions
tblProbs <- tibble::as_tibble(testProbs) %>%
mutate(maxProb=apply(., 1, FUN=max),
sumProb=apply(., 1, FUN=sum),
predClass=testClass,
locale=testData$locale,
accurate=(predClass==locale)
)
# Describe the maximum probability by source
if (printConfSummary) {
tblProbs %>%
group_by(locale) %>%
summarize(meanMax=mean(maxProb), medianMax=median(maxProb),
pct90Plus=mean(maxProb > 0.9), pct50Minus=mean(maxProb < 0.5)
) %>%
print()
}
# Create a table of accuracy by source and prediction confidence
p1Data <- tblProbs %>%
mutate(predProb=0.5 * round(2*maxProb, 1)) %>%
group_by(predProb, locale) %>%
summarize(pctCorrect=mean(accurate), nCorrect=sum(accurate), nObs=n())
p1Print <- p1Data %>%
group_by(predProb) %>%
summarize(nCorrect=sum(nCorrect), nObs=sum(nObs)) %>%
mutate(pctCorrect=nCorrect/nObs)
if (printConfTable) {
print(p1Print)
}
cat("\nMean Error-Squared Between Confidence of Prediction and Accuracy of Precition\n")
p1Print %>%
mutate(err2=nObs*(pctCorrect-predProb)**2) %>%
summarize(meanError2=sum(err2)/sum(nObs)) %>%
print()
# Plot the maximum probability forecasted by row
p1 <- p1Data %>%
ggplot(aes(x=predProb)) +
geom_col(aes(y=nObs, fill=locale)) +
labs(x="Maximum probability predicted", y="# Observations",
title="Count of Maximum Probability Predicted by Locale"
)
p2 <- p1Data %>%
ggplot(aes(x=predProb)) +
geom_line(aes(y=pctCorrect, group=locale, color=locale)) +
geom_abline(aes(intercept=0, slope=1), lty=2) +
ylim(c(0, 1)) +
labs(x="Maximum probability predicted", y="Actual Probability Correct",
title="Accuracy of Maximum Probability Predicted by Locale"
)
if (showPlots) {
print(p1)
print(p2)
}
tblProbs
}
And the function is then run for the Las Vegas and New Orleans data:
evalPredictions(caretBest, testData=ttLists$testData)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Las Vegas, NV (2016) New Orleans, LA (2016)
## Las Vegas, NV (2016) 2578 27
## New Orleans, LA (2016) 28 2613
##
## Accuracy : 0.9895
## 95% CI : (0.9864, 0.9921)
## No Information Rate : 0.5032
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.979
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9893
## Specificity : 0.9898
## Pos Pred Value : 0.9896
## Neg Pred Value : 0.9894
## Prevalence : 0.4968
## Detection Rate : 0.4914
## Detection Prevalence : 0.4966
## Balanced Accuracy : 0.9895
##
## 'Positive' Class : Las Vegas, NV (2016)
##
## # A tibble: 2 x 5
## locale meanMax medianMax pct90Plus pct50Minus
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Las Vegas, NV (2016) 0.978 1 0.923 0
## 2 New Orleans, LA (2016) 0.983 1 0.939 0
## # A tibble: 11 x 4
## predProb nCorrect nObs pctCorrect
## <dbl> <int> <int> <dbl>
## 1 0.5 11 20 0.55
## 2 0.55 10 21 0.476
## 3 0.6 14 20 0.7
## 4 0.65 13 18 0.722
## 5 0.7 34 43 0.791
## 6 0.75 29 33 0.879
## 7 0.8 50 52 0.962
## 8 0.85 55 55 1
## 9 0.9 152 157 0.968
## 10 0.95 193 196 0.985
## 11 1 4630 4631 1.00
##
## Mean Error-Squared Between Confidence of Prediction and Accuracy of Precition
## # A tibble: 1 x 1
## meanError2
## <dbl>
## 1 0.000938
## # A tibble: 5,246 x 7
## `Las Vegas, NV (~ `New Orleans, LA~ maxProb sumProb predClass locale accurate
## <dbl> <dbl> <dbl> <dbl> <fct> <fct> <lgl>
## 1 0.98 0.02 0.98 1 Las Vega~ Las V~ TRUE
## 2 1 0 1 1 Las Vega~ Las V~ TRUE
## 3 1 0 1 1 Las Vega~ Las V~ TRUE
## 4 0.98 0.02 0.98 1 Las Vega~ Las V~ TRUE
## 5 1 0 1 1 Las Vega~ Las V~ TRUE
## 6 1 0 1 1 Las Vega~ Las V~ TRUE
## 7 1 0 1 1 Las Vega~ Las V~ TRUE
## 8 1 0 1 1 Las Vega~ Las V~ TRUE
## 9 1 0 1 1 Las Vega~ Las V~ TRUE
## 10 1 0 1 1 Las Vega~ Las V~ TRUE
## # ... with 5,236 more rows
caret::varImp(caretBest) %>%
rownames_to_column() %>%
ggplot(aes(x=fct_reorder(rowname, -Overall), y=Overall)) +
geom_col() +
labs(x="", y="Importance")
The model has greater than 98% accuracy in splitting Las Vegas and New Orleans, driven by 1) New Orleans being extremely humid and Las Vegas being a desert, and 2) New Orleans being at sea level and Las Vegas being at high altitude.
Using modSLP and Altimeter is arguably cheating since the values and relationships between them are highly driven by a location’s height relative to sea-level. How does the model perform if these are deleted?
# Create a tuning grid and run the models
trGrid <- expand.grid(min.node.size=c(1, 5, 10, 25, 100),
mtry=c(1, 2, 3, 4),
splitrule=c("gini")
)
caretModel <- caret::train(locale ~ DewF + month + predomDir + TempF,
data=ttLists$trainData,
method="ranger",
tuneGrid=trGrid,
trControl=caret::trainControl(method="cv", number=5),
num.trees=50
)
caretModel
## Random Forest
##
## 12242 samples
## 4 predictor
## 2 classes: 'Las Vegas, NV (2016)', 'New Orleans, LA (2016)'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 9795, 9793, 9793, 9794, 9793
## Resampling results across tuning parameters:
##
## min.node.size mtry Accuracy Kappa
## 1 1 0.7760118 0.5520364
## 1 2 0.9124317 0.8248627
## 1 3 0.9327730 0.8655459
## 1 4 0.9500081 0.9000133
## 5 1 0.7602487 0.5204263
## 5 2 0.8996882 0.7993784
## 5 3 0.9323644 0.8647286
## 5 4 0.9480482 0.8960938
## 10 1 0.7947936 0.5895973
## 10 2 0.9154553 0.8309076
## 10 3 0.9313021 0.8626014
## 10 4 0.9465769 0.8931516
## 25 1 0.8033830 0.6068192
## 25 2 0.8972383 0.7944852
## 25 3 0.9300767 0.8601515
## 25 4 0.9462493 0.8924960
## 100 1 0.8175975 0.6352329
## 100 2 0.8979737 0.7959561
## 100 3 0.9250123 0.8500224
## 100 4 0.9345694 0.8691359
##
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 4, splitrule = gini
## and min.node.size = 1.
The model can then be run using the best parameters:
# Run the best parameters from ranger in randomForest
caretBest <- randomForest::randomForest(locale ~ DewF + month + predomDir + TempF,
data=ttLists$trainData,
ntree=50,
nodesize=1,
mtry=4
)
caretBest
##
## Call:
## randomForest(formula = locale ~ DewF + month + predomDir + TempF, data = ttLists$trainData, ntree = 50, nodesize = 1, mtry = 4)
## Type of random forest: classification
## Number of trees: 50
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 3.65%
## Confusion matrix:
## Las Vegas, NV (2016) New Orleans, LA (2016) class.error
## Las Vegas, NV (2016) 5912 217 0.03540545
## New Orleans, LA (2016) 230 5883 0.03762473
As expected, performance dips to around 95% accuracy:
evalPredictions(caretBest, testData=ttLists$testData)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Las Vegas, NV (2016) New Orleans, LA (2016)
## Las Vegas, NV (2016) 2514 89
## New Orleans, LA (2016) 92 2551
##
## Accuracy : 0.9655
## 95% CI : (0.9602, 0.9703)
## No Information Rate : 0.5032
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.931
##
## Mcnemar's Test P-Value : 0.8818
##
## Sensitivity : 0.9647
## Specificity : 0.9663
## Pos Pred Value : 0.9658
## Neg Pred Value : 0.9652
## Prevalence : 0.4968
## Detection Rate : 0.4792
## Detection Prevalence : 0.4962
## Balanced Accuracy : 0.9655
##
## 'Positive' Class : Las Vegas, NV (2016)
##
## # A tibble: 2 x 5
## locale meanMax medianMax pct90Plus pct50Minus
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Las Vegas, NV (2016) 0.962 1 0.868 0
## 2 New Orleans, LA (2016) 0.971 1 0.900 0
## # A tibble: 11 x 4
## predProb nCorrect nObs pctCorrect
## <dbl> <int> <int> <dbl>
## 1 0.5 8 27 0.296
## 2 0.55 12 22 0.545
## 3 0.6 37 64 0.578
## 4 0.65 38 59 0.644
## 5 0.7 62 79 0.785
## 6 0.75 31 50 0.62
## 7 0.8 85 104 0.817
## 8 0.85 80 93 0.860
## 9 0.9 182 194 0.938
## 10 0.95 221 227 0.974
## 11 1 4309 4327 0.996
##
## Mean Error-Squared Between Confidence of Prediction and Accuracy of Precition
## # A tibble: 1 x 1
## meanError2
## <dbl>
## 1 0.000589
## # A tibble: 5,246 x 7
## `Las Vegas, NV (~ `New Orleans, LA~ maxProb sumProb predClass locale accurate
## <dbl> <dbl> <dbl> <dbl> <fct> <fct> <lgl>
## 1 0.96 0.04 0.96 1 Las Vega~ Las V~ TRUE
## 2 0.96 0.04 0.96 1 Las Vega~ Las V~ TRUE
## 3 1 0 1 1 Las Vega~ Las V~ TRUE
## 4 1 0 1 1 Las Vega~ Las V~ TRUE
## 5 1 0 1 1 Las Vega~ Las V~ TRUE
## 6 1 0 1 1 Las Vega~ Las V~ TRUE
## 7 1 0 1 1 Las Vega~ Las V~ TRUE
## 8 1 0 1 1 Las Vega~ Las V~ TRUE
## 9 1 0 1 1 Las Vega~ Las V~ TRUE
## 10 1 0 1 1 Las Vega~ Las V~ TRUE
## # ... with 5,236 more rows
caret::varImp(caretBest) %>%
rownames_to_column() %>%
ggplot(aes(x=fct_reorder(rowname, -Overall), y=Overall)) +
geom_col() +
labs(x="", y="Importance")
The very large difference in dew point drives the high classification ability:
ttLists$trainData %>%
bind_rows(ttLists$testData) %>%
ggplot(aes(x=TempF, y=DewF)) +
geom_bin2d() +
facet_wrap(~locale) +
geom_density_2d() +
scale_fill_continuous(low="white", high="black")
Suppose on the other hand that the model tries to distinguish New Orleans from Houston:
ttLists <- createTestTrain(metarData,
sources=c("kiah_2016", "kmsy_2016"),
vrbls=c("locale", "month", "predomDir", "TempF", "DewF"),
seed=2006181440
)
ttLists
## $testData
## # A tibble: 5,248 x 5
## locale month predomDir TempF DewF
## <fct> <fct> <fct> <dbl> <dbl>
## 1 Houston, TX (2016) Jan N 53.1 39.9
## 2 Houston, TX (2016) Jan N 52.0 39.9
## 3 Houston, TX (2016) Jan N 48.0 42.1
## 4 Houston, TX (2016) Jan N 48.0 35.1
## 5 Houston, TX (2016) Jan N 48.0 35.1
## 6 Houston, TX (2016) Jan N 48.9 35.1
## 7 Houston, TX (2016) Jan N 48.0 35.1
## 8 Houston, TX (2016) Jan N 48.0 37.0
## 9 Houston, TX (2016) Jan N 46.9 39.0
## 10 Houston, TX (2016) Jan N 45.0 37.0
## # ... with 5,238 more rows
##
## $trainData
## # A tibble: 12,245 x 5
## locale month predomDir TempF DewF
## <fct> <fct> <fct> <dbl> <dbl>
## 1 New Orleans, LA (2016) Dec N 53.1 37.9
## 2 Houston, TX (2016) Oct S 75.9 71.1
## 3 Houston, TX (2016) Oct NE 73.0 44.1
## 4 New Orleans, LA (2016) May N 72.0 55.9
## 5 New Orleans, LA (2016) Aug S 78.1 72.0
## 6 Houston, TX (2016) Jun 000 78.1 75.0
## 7 Houston, TX (2016) Mar 000 54.0 53.1
## 8 Houston, TX (2016) Oct N 81.0 46.0
## 9 New Orleans, LA (2016) Dec SW 80.1 64.9
## 10 Houston, TX (2016) Sep S 84.0 78.1
## # ... with 12,235 more rows
# Create a tuning grid and run the models
trGrid <- expand.grid(min.node.size=c(1, 5, 10, 25, 100),
mtry=c(1, 2, 3, 4),
splitrule=c("gini")
)
caretModel <- caret::train(locale ~ DewF + month + predomDir + TempF,
data=ttLists$trainData,
method="ranger",
tuneGrid=trGrid,
trControl=caret::trainControl(method="cv", number=5),
num.trees=50
)
caretModel
## Random Forest
##
## 12245 samples
## 4 predictor
## 2 classes: 'Houston, TX (2016)', 'New Orleans, LA (2016)'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 9796, 9796, 9796, 9796, 9796
## Resampling results across tuning parameters:
##
## min.node.size mtry Accuracy Kappa
## 1 1 0.5741119 0.1467460
## 1 2 0.6125766 0.2248296
## 1 3 0.6340547 0.2680327
## 1 4 0.6553695 0.3108463
## 5 1 0.5731319 0.1444556
## 5 2 0.6140465 0.2278396
## 5 3 0.6317681 0.2635971
## 5 4 0.6561045 0.3122708
## 10 1 0.5743569 0.1471436
## 10 2 0.6134749 0.2265743
## 10 3 0.6307064 0.2614035
## 10 4 0.6551245 0.3103053
## 25 1 0.5802368 0.1595270
## 25 2 0.6185382 0.2367579
## 25 3 0.6270314 0.2539625
## 25 4 0.6541445 0.3082967
## 100 1 0.5728052 0.1440925
## 100 2 0.6066966 0.2129075
## 100 3 0.6273581 0.2546343
## 100 4 0.6435280 0.2871188
##
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 4, splitrule = gini
## and min.node.size = 5.
The best parameters from the model can then be run:
# Run the best parameters from ranger in randomForest
caretBest <- randomForest::randomForest(locale ~ DewF + month + predomDir + TempF,
data=ttLists$trainData,
ntree=50,
nodesize=10,
mtry=4
)
caretBest
##
## Call:
## randomForest(formula = locale ~ DewF + month + predomDir + TempF, data = ttLists$trainData, ntree = 50, nodesize = 10, mtry = 4)
## Type of random forest: classification
## Number of trees: 50
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 27.48%
## Confusion matrix:
## Houston, TX (2016) New Orleans, LA (2016) class.error
## Houston, TX (2016) 4269 1880 0.3057408
## New Orleans, LA (2016) 1485 4611 0.2436024
Accuracy dips to the 70% range, though this is still surprisingly high given how similar the climates are in New Orleans and Houston:
evalPredictions(caretBest, testData=ttLists$testData)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Houston, TX (2016) New Orleans, LA (2016)
## Houston, TX (2016) 1789 587
## New Orleans, LA (2016) 802 2070
##
## Accuracy : 0.7353
## 95% CI : (0.7232, 0.7472)
## No Information Rate : 0.5063
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.47
##
## Mcnemar's Test P-Value : 9.357e-09
##
## Sensitivity : 0.6905
## Specificity : 0.7791
## Pos Pred Value : 0.7529
## Neg Pred Value : 0.7208
## Prevalence : 0.4937
## Detection Rate : 0.3409
## Detection Prevalence : 0.4527
## Balanced Accuracy : 0.7348
##
## 'Positive' Class : Houston, TX (2016)
##
## # A tibble: 2 x 5
## locale meanMax medianMax pct90Plus pct50Minus
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Houston, TX (2016) 0.784 0.78 0.293 0
## 2 New Orleans, LA (2016) 0.785 0.8 0.275 0
## # A tibble: 11 x 4
## predProb nCorrect nObs pctCorrect
## <dbl> <int> <int> <dbl>
## 1 0.5 119 229 0.520
## 2 0.55 198 352 0.562
## 3 0.6 308 535 0.576
## 4 0.65 228 370 0.616
## 5 0.7 335 504 0.665
## 6 0.75 297 409 0.726
## 7 0.8 432 582 0.742
## 8 0.85 324 404 0.802
## 9 0.9 457 564 0.810
## 10 0.95 431 499 0.864
## 11 1 730 800 0.912
##
## Mean Error-Squared Between Confidence of Prediction and Accuracy of Precition
## # A tibble: 1 x 1
## meanError2
## <dbl>
## 1 0.00362
## # A tibble: 5,248 x 7
## `Houston, TX (20~ `New Orleans, LA~ maxProb sumProb predClass locale accurate
## <dbl> <dbl> <dbl> <dbl> <fct> <fct> <lgl>
## 1 0.4 0.6 0.6 1 New Orle~ Houst~ FALSE
## 2 0.6 0.4 0.6 1 Houston,~ Houst~ TRUE
## 3 0.2 0.8 0.8 1 New Orle~ Houst~ FALSE
## 4 0 1 1 1 New Orle~ Houst~ FALSE
## 5 0 1 1 1 New Orle~ Houst~ FALSE
## 6 0.06 0.94 0.94 1 New Orle~ Houst~ FALSE
## 7 0 1 1 1 New Orle~ Houst~ FALSE
## 8 0 1 1 1 New Orle~ Houst~ FALSE
## 9 0.64 0.36 0.64 1 Houston,~ Houst~ TRUE
## 10 1 0 1 1 Houston,~ Houst~ TRUE
## # ... with 5,238 more rows
caret::varImp(caretBest) %>%
rownames_to_column() %>%
ggplot(aes(x=fct_reorder(rowname, -Overall), y=Overall)) +
geom_col() +
labs(x="", y="Importance")
The mix of temperature and dew-point is the primary driver of the classiciations. There are many more “low confidence” (close voting) predictions for these two cities:
ttLists$trainData %>%
bind_rows(ttLists$testData) %>%
ggplot(aes(x=TempF, y=DewF)) +
geom_bin2d() +
facet_wrap(~locale) +
geom_density_2d() +
scale_fill_continuous(low="white", high="black")
It is impressive that the model can tease out distinctions in data that are, at a glance, very similar.
Suppose that the only information available about two cities were their temperatures and dew points. How well would a basic random forest, with mtry=2, classify the cities?
A function is written to take two locales and a random seed, create relevant test-train data, run a random forest model, make predictions, assess the accuracy, and return the relevant objects:
# The createTestTrain function is updated to purely split an input dataframe
createTestTrain <- function(df,
testSize=0.3,
sortTrain=FALSE,
noNA=TRUE,
seed=NULL
) {
# FUNCTION ARGUMENTS:
# df: the data frame or tibble for analysis
# testSize: the proportion of the data to be used as test
# sortTrain: boolean, whether to sort training indices to maintain the original order of training data
# noNA: boolean, whether to include only complete cases
# seed: the random seed to be used (NULL means no seed)
# Filter out NA if requested
if (noNA) {
df <- df %>%
filter(complete.cases(df))
}
# Get the desired number of train objects
nTrain <- round((1-testSize) * nrow(df))
# Set the random seed if it has been passed
if (!is.null(seed)) { set.seed(seed) }
# Get the indices for the training data
idxTrain <- sample(1:nrow(df), size=nTrain, replace=FALSE)
# Sort if requested
if (sortTrain) { idxTrain <- sort(idxTrain) }
# Return a list containing the train and test data
list(trainData=df[idxTrain, ],
testData=df[-idxTrain, ]
)
}
# Run a random forest for two locales
rfTwoLocales <- function(df,
loc1,
loc2,
locVar="source",
otherVar="dtime",
vrbls=c("TempF", "DewF"),
pred="locale",
seed=NULL,
ntree=100,
mtry=NULL,
testSize=0.3
) {
# FUNCTION ARGUMENTS:
# df: the data frame or tibble
# loc1: the first locale
# loc2: the second locale
# locVar: the name of the variable where loc1 and loc2 can be found
# otherVar: other variables to be kept, but not used in modeling
# vrbls: explanatory variables for modeling
# pred: predictor variable for modeling
# seed: the random seed (NULL means no seed)
# ntree: the number of trees to grow in the random forest
# mtry: the splitting parameter for the random forest (NULL means use all variables)
# testSize: proportion of data to use as test dataset
# Filter df such that it includes only observations in loc1 or loc2
# Select only the predictor and variables of interest
dfModel <- df %>%
filter(get(locVar) %in% c(loc1, loc2)) %>%
select_at(vars(all_of(c(pred, otherVar, vrbls))))
# Create the test-train split (will randomize using seed if provided)
ttLists <- createTestTrain(dfModel, testSize=testSize, seed=seed)
# Set the seed if requested
if (!is.null(seed)) { set.seed(seed) }
# Set mtry to be the length of the variables if not provided
if (is.null(mtry)) { mtry <- length(vrbls) }
# Run the random forest on the training data
rfModel <- randomForest::randomForest(x=ttLists$trainData[, vrbls],
y=factor(ttLists$trainData[, pred, drop=TRUE]),
mtry=mtry,
ntree=ntree
)
# Create predictions on the test data
testPred <- predict(rfModel, ttLists$testData)
# Augment the test data with the predictions
testData <- ttLists$testData %>%
mutate(predicted=testPred, correct=(testPred==get(pred)))
# Grab the accuracy and accuracy by class
# Return the objects as a list
list(rfModel=rfModel,
testData=testData,
errorRate=colMeans(rfModel$err.rate)
)
}
The function is then run for every combination of locales from 2016 in cityNameMapper. A common random seed is applied to every run of the process:
# Create the file names to explore
names_2016 <- grep(x=names(cityNameMapper), pattern="_2016", value=TRUE)
# Create a container list to hold the output
list_2016_TempF_DewF <- vector("list", 0.5*length(names_2016)*(length(names_2016)-1))
n <- 1
for (ctr in 1:(length(names_2016)-1)) {
for (ctr2 in (ctr+1):length(names_2016)) {
list_2016_TempF_DewF[[n]] <- rfTwoLocales(metarData,
loc1=names_2016[ctr],
loc2=names_2016[ctr2],
vrbls=c("TempF", "DewF"),
ntree=25
)
n <- n + 1
}
}
Statistics about the overall accuracy can then be captured:
# Helper function for map_dfr
helperAccuracyLocale <- function(x) {
y <- x$errorRate
tibble::tibble(locale1=names(y)[2],
locale2=names(y)[3],
accOverall=1-y[1],
accLocale1=1-y[2],
accLocale2=1-y[3]
)
}
# Create a tibble from the underlying accuracy data
acc_TempF_DewF <- map_dfr(list_2016_TempF_DewF, .f=helperAccuracyLocale)
# Assess the top 10 accuracy and the bottom 10 accuracy
acc_TempF_DewF %>%
arrange(-accOverall) %>%
head(10)
## # A tibble: 10 x 5
## locale1 locale2 accOverall accLocale1 accLocale2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Green Bay, WI (2016) Las Vegas, NV (2016) 0.897 0.893 0.901
## 2 Las Vegas, NV (2016) New Orleans, LA (201~ 0.890 0.903 0.877
## 3 Las Vegas, NV (2016) Milwaukee, WI (2016) 0.888 0.887 0.889
## 4 Houston, TX (2016) Las Vegas, NV (2016) 0.884 0.871 0.897
## 5 Las Vegas, NV (2016) Madison, WI (2016) 0.879 0.892 0.866
## 6 Indianapolis, IN (201~ Las Vegas, NV (2016) 0.875 0.893 0.859
## 7 Chicago, IL (2016) Las Vegas, NV (2016) 0.874 0.893 0.854
## 8 Grand Rapids, MI (201~ Las Vegas, NV (2016) 0.872 0.888 0.857
## 9 Las Vegas, NV (2016) San Diego, CA (2016) 0.867 0.872 0.863
## 10 Las Vegas, NV (2016) Traverse City, MI (2~ 0.865 0.863 0.868
acc_TempF_DewF %>%
arrange(accOverall) %>%
head(10)
## # A tibble: 10 x 5
## locale1 locale2 accOverall accLocale1 accLocale2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Chicago, IL (2016) Milwaukee, WI (2016) 0.539 0.574 0.505
## 2 Chicago, IL (2016) Madison, WI (2016) 0.548 0.580 0.515
## 3 Detroit, MI (2016) Grand Rapids, MI (20~ 0.550 0.491 0.610
## 4 Detroit, MI (2016) Lincoln, NE (2016) 0.551 0.589 0.513
## 5 Chicago, IL (2016) Detroit, MI (2016) 0.552 0.522 0.581
## 6 Grand Rapids, MI (201~ Traverse City, MI (2~ 0.553 0.595 0.511
## 7 Indianapolis, IN (201~ Lincoln, NE (2016) 0.554 0.595 0.513
## 8 Detroit, MI (2016) Traverse City, MI (2~ 0.558 0.496 0.619
## 9 Detroit, MI (2016) Indianapolis, IN (20~ 0.559 0.587 0.531
## 10 Green Bay, WI (2016) Madison, WI (2016) 0.560 0.528 0.591
Not surprisingly, accuracy is highest when comparing cold-weather cities or hot-humid cities to Las Vegas. Accuracy is lowest when comparing cold-weather cities against each other. Interestingly, Chicago-Milwaukee and Grand Rapids-Traverse City have again emerged as the two “closest” cities in this analysis.
Of further interest is an overall accuracy by city:
allAccuracy <- select(acc_TempF_DewF, locale=locale1, other=locale2, accOverall, accLocale=accLocale1) %>%
bind_rows(select(acc_TempF_DewF, locale=locale2, other=locale1, accOverall, accLocale=accLocale2))
# Find the best match by locale
allAccuracy %>%
group_by(locale) %>%
top_n(accOverall, n=1) %>%
arrange(-accOverall)
## # A tibble: 15 x 4
## # Groups: locale [15]
## locale other accOverall accLocale
## <chr> <chr> <dbl> <dbl>
## 1 Green Bay, WI (2016) Las Vegas, NV (2016) 0.897 0.893
## 2 Las Vegas, NV (2016) Green Bay, WI (2016) 0.897 0.901
## 3 New Orleans, LA (2016) Las Vegas, NV (2016) 0.890 0.877
## 4 Milwaukee, WI (2016) Las Vegas, NV (2016) 0.888 0.889
## 5 Houston, TX (2016) Las Vegas, NV (2016) 0.884 0.871
## 6 Madison, WI (2016) Las Vegas, NV (2016) 0.879 0.866
## 7 Indianapolis, IN (2016) Las Vegas, NV (2016) 0.875 0.893
## 8 Chicago, IL (2016) Las Vegas, NV (2016) 0.874 0.893
## 9 Grand Rapids, MI (2016) Las Vegas, NV (2016) 0.872 0.888
## 10 San Diego, CA (2016) Las Vegas, NV (2016) 0.867 0.863
## 11 Traverse City, MI (2016) Las Vegas, NV (2016) 0.865 0.868
## 12 Detroit, MI (2016) Las Vegas, NV (2016) 0.864 0.885
## 13 Minneapolis, MN (2016) Las Vegas, NV (2016) 0.856 0.856
## 14 Lincoln, NE (2016) Las Vegas, NV (2016) 0.845 0.833
## 15 Newark, NJ (2016) Las Vegas, NV (2016) 0.809 0.812
# Find the worst match by locale
allAccuracy %>%
group_by(locale) %>%
top_n(-accOverall, n=1) %>%
arrange(accOverall)
## # A tibble: 15 x 4
## # Groups: locale [15]
## locale other accOverall accLocale
## <chr> <chr> <dbl> <dbl>
## 1 Chicago, IL (2016) Milwaukee, WI (2016) 0.539 0.574
## 2 Milwaukee, WI (2016) Chicago, IL (2016) 0.539 0.505
## 3 Madison, WI (2016) Chicago, IL (2016) 0.548 0.515
## 4 Detroit, MI (2016) Grand Rapids, MI (2016) 0.550 0.491
## 5 Grand Rapids, MI (2016) Detroit, MI (2016) 0.550 0.610
## 6 Lincoln, NE (2016) Detroit, MI (2016) 0.551 0.513
## 7 Traverse City, MI (2016) Grand Rapids, MI (2016) 0.553 0.511
## 8 Indianapolis, IN (2016) Lincoln, NE (2016) 0.554 0.595
## 9 Green Bay, WI (2016) Madison, WI (2016) 0.560 0.528
## 10 Minneapolis, MN (2016) Madison, WI (2016) 0.568 0.555
## 11 Newark, NJ (2016) Lincoln, NE (2016) 0.590 0.571
## 12 Houston, TX (2016) New Orleans, LA (2016) 0.605 0.542
## 13 New Orleans, LA (2016) Houston, TX (2016) 0.605 0.667
## 14 San Diego, CA (2016) Minneapolis, MN (2016) 0.764 0.834
## 15 Las Vegas, NV (2016) Newark, NJ (2016) 0.809 0.806
# Overall accuracy by location plot
allAccuracy %>%
group_by(locale) %>%
summarize_if(is.numeric, mean) %>%
ggplot(aes(x=fct_reorder(locale, accOverall), y=accOverall)) +
geom_point(size=2) +
geom_text(aes(y=accOverall+0.02, label=paste0(round(100*accOverall), "%"))) +
coord_flip() +
labs(x="",
y="Average Accuracy",
title="Average Accuracy Predicting Locale",
subtitle="Predictions made 1:1 to each other locale (average accuracy reported)",
caption="Temperature and Dew Point as only predictors\n(50% is baseline coinflip accuracy)"
) +
ylim(c(0.5, 1))
# Overall accuracy heatmap
allAccuracy %>%
ggplot(aes(x=locale, y=other)) +
geom_tile(aes(fill=accOverall)) +
theme(axis.text.x=element_text(angle=90)) +
scale_fill_continuous("Accuracy", high="darkblue", low="white") +
labs(title="Accuracy Predicting Locale vs. Locale",
caption="Temperature and Dew Point as only predictors\n(50% is baseline coinflip accuracy)",
x="",
y=""
)
One hypothesis is that adding month as a predictor may help for distinguishing cities in different climates. For example, summer in the warm season in Detroit may look like spring/fall in Houston or New Orleans:
# Create a container list to hold the output
list_2016_TempF_DewF_month <- vector("list", 0.5*length(names_2016)*(length(names_2016)-1))
n <- 1
for (ctr in 1:(length(names_2016)-1)) {
for (ctr2 in (ctr+1):length(names_2016)) {
list_2016_TempF_DewF_month[[n]] <- rfTwoLocales(metarData,
loc1=names_2016[ctr],
loc2=names_2016[ctr2],
vrbls=c("TempF", "DewF", "month"),
ntree=25
)
n <- n + 1
}
}
Accuracy can then be assessed:
# Create a tibble from the underlying accuracy data
acc_TempF_DewF_month <- map_dfr(list_2016_TempF_DewF_month, .f=helperAccuracyLocale)
# Assess the top 10 accuracy and the bottom 10 accuracy
acc_TempF_DewF_month %>%
arrange(-accOverall) %>%
head(10)
## # A tibble: 10 x 5
## locale1 locale2 accOverall accLocale1 accLocale2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Green Bay, WI (2016) Las Vegas, NV (2016) 0.961 0.960 0.961
## 2 Las Vegas, NV (2016) Madison, WI (2016) 0.949 0.952 0.945
## 3 Las Vegas, NV (2016) Milwaukee, WI (2016) 0.944 0.947 0.941
## 4 New Orleans, LA (2016) Traverse City, MI (2~ 0.941 0.946 0.937
## 5 Las Vegas, NV (2016) Traverse City, MI (2~ 0.940 0.940 0.939
## 6 Minneapolis, MN (2016) New Orleans, LA (201~ 0.939 0.941 0.938
## 7 Green Bay, WI (2016) New Orleans, LA (201~ 0.938 0.931 0.944
## 8 Grand Rapids, MI (201~ Las Vegas, NV (2016) 0.937 0.939 0.934
## 9 Houston, TX (2016) Traverse City, MI (2~ 0.935 0.938 0.931
## 10 Las Vegas, NV (2016) Minneapolis, MN (201~ 0.935 0.938 0.931
acc_TempF_DewF_month %>%
arrange(accOverall) %>%
head(10)
## # A tibble: 10 x 5
## locale1 locale2 accOverall accLocale1 accLocale2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Chicago, IL (2016) Milwaukee, WI (2016) 0.578 0.619 0.536
## 2 Detroit, MI (2016) Grand Rapids, MI (20~ 0.584 0.605 0.564
## 3 Grand Rapids, MI (201~ Traverse City, MI (2~ 0.598 0.623 0.574
## 4 Chicago, IL (2016) Madison, WI (2016) 0.600 0.630 0.569
## 5 Green Bay, WI (2016) Madison, WI (2016) 0.602 0.631 0.573
## 6 Chicago, IL (2016) Grand Rapids, MI (20~ 0.606 0.580 0.631
## 7 Grand Rapids, MI (201~ Madison, WI (2016) 0.608 0.590 0.626
## 8 Madison, WI (2016) Milwaukee, WI (2016) 0.610 0.641 0.580
## 9 Chicago, IL (2016) Detroit, MI (2016) 0.612 0.666 0.558
## 10 Chicago, IL (2016) Indianapolis, IN (20~ 0.613 0.637 0.590
allAccuracy_month <- select(acc_TempF_DewF_month,
locale=locale1,
other=locale2,
accOverall,
accLocale=accLocale1
) %>%
bind_rows(select(acc_TempF_DewF_month,
locale=locale2,
other=locale1,
accOverall,
accLocale=accLocale2
)
)
# Find the best match by locale
allAccuracy_month %>%
group_by(locale) %>%
top_n(accOverall, n=1) %>%
arrange(-accOverall)
## # A tibble: 15 x 4
## # Groups: locale [15]
## locale other accOverall accLocale
## <chr> <chr> <dbl> <dbl>
## 1 Green Bay, WI (2016) Las Vegas, NV (2016) 0.961 0.960
## 2 Las Vegas, NV (2016) Green Bay, WI (2016) 0.961 0.961
## 3 Madison, WI (2016) Las Vegas, NV (2016) 0.949 0.945
## 4 Milwaukee, WI (2016) Las Vegas, NV (2016) 0.944 0.941
## 5 New Orleans, LA (2016) Traverse City, MI (2016) 0.941 0.946
## 6 Traverse City, MI (2016) New Orleans, LA (2016) 0.941 0.937
## 7 Minneapolis, MN (2016) New Orleans, LA (2016) 0.939 0.941
## 8 Grand Rapids, MI (2016) Las Vegas, NV (2016) 0.937 0.939
## 9 Houston, TX (2016) Traverse City, MI (2016) 0.935 0.938
## 10 Chicago, IL (2016) Las Vegas, NV (2016) 0.933 0.933
## 11 Detroit, MI (2016) Las Vegas, NV (2016) 0.928 0.928
## 12 Indianapolis, IN (2016) Las Vegas, NV (2016) 0.920 0.927
## 13 San Diego, CA (2016) Las Vegas, NV (2016) 0.910 0.900
## 14 Lincoln, NE (2016) Las Vegas, NV (2016) 0.906 0.907
## 15 Newark, NJ (2016) New Orleans, LA (2016) 0.873 0.858
# Find the worst match by locale
allAccuracy_month %>%
group_by(locale) %>%
top_n(-accOverall, n=1) %>%
arrange(accOverall)
## # A tibble: 15 x 4
## # Groups: locale [15]
## locale other accOverall accLocale
## <chr> <chr> <dbl> <dbl>
## 1 Chicago, IL (2016) Milwaukee, WI (2016) 0.578 0.619
## 2 Milwaukee, WI (2016) Chicago, IL (2016) 0.578 0.536
## 3 Detroit, MI (2016) Grand Rapids, MI (2016) 0.584 0.605
## 4 Grand Rapids, MI (2016) Detroit, MI (2016) 0.584 0.564
## 5 Traverse City, MI (2016) Grand Rapids, MI (2016) 0.598 0.574
## 6 Madison, WI (2016) Chicago, IL (2016) 0.600 0.569
## 7 Green Bay, WI (2016) Madison, WI (2016) 0.602 0.631
## 8 Indianapolis, IN (2016) Chicago, IL (2016) 0.613 0.590
## 9 Lincoln, NE (2016) Indianapolis, IN (2016) 0.618 0.559
## 10 Minneapolis, MN (2016) Madison, WI (2016) 0.626 0.611
## 11 Houston, TX (2016) New Orleans, LA (2016) 0.640 0.643
## 12 New Orleans, LA (2016) Houston, TX (2016) 0.640 0.637
## 13 Newark, NJ (2016) Indianapolis, IN (2016) 0.662 0.678
## 14 San Diego, CA (2016) Indianapolis, IN (2016) 0.846 0.857
## 15 Las Vegas, NV (2016) Newark, NJ (2016) 0.872 0.872
# Overall accuracy by location plot
allAccuracy_month %>%
group_by(locale) %>%
summarize_if(is.numeric, mean) %>%
ggplot(aes(x=fct_reorder(locale, accOverall), y=accOverall)) +
geom_point(size=2) +
geom_text(aes(y=accOverall+0.02, label=paste0(round(100*accOverall), "%"))) +
coord_flip() +
labs(x="",
y="Average Accuracy",
title="Average Accuracy Predicting Locale",
subtitle="Predictions made 1:1 to each other locale (average accuracy reported)",
caption="Temperature and Dew Point as only predictors\n(50% is baseline coinflip accuracy)"
) +
ylim(c(0.5, 1))
# Overall accuracy heatmap
allAccuracy_month %>%
ggplot(aes(x=locale, y=other)) +
geom_tile(aes(fill=accOverall)) +
theme(axis.text.x=element_text(angle=90)) +
scale_fill_continuous("Accuracy", high="darkblue", low="white") +
labs(title="Accuracy Predicting Locale vs. Locale",
caption="Temperature and Dew Point as only predictors\n(50% is baseline coinflip accuracy)",
x="",
y=""
)
Accuracies are meaningfully higher. Of interest, there is a new group of highly differentiated locales, including New Orleans/Houston vs. Traverse City/Minneapolis. As hypothesized, adding month significantly aids in differentiating wintry cities from hot-humid cities, as summer in a wintry city may otherwise be hard to distinguish from spring/fall in a hot-humid city.
In fact, with the exception of Houston vs. New Orleans, there appears to be very good differentiation for each of Las Vegas, San Diego, Hoston, and New Orleans from all other locales.
Change in accuracy can also be plotted:
# Pull the accuracies for the two different models
x1 <- allAccuracy %>%
group_by(locale) %>%
summarize(acc1=mean(accOverall))
x2 <- allAccuracy_month %>%
group_by(locale) %>%
summarize(acc2=mean(accOverall))
# Merge accuracy data and plot
x1 %>%
inner_join(x2, by="locale") %>%
ggplot(aes(x=fct_reorder(locale, acc2))) +
geom_point(aes(y=acc2), size=2) +
geom_point(aes(y=acc1), size=1) +
geom_text(aes(y=acc2+0.02, label=paste0(round(100*acc2), "%"))) +
geom_text(aes(y=acc1-0.02, label=paste0(round(100*acc1), "%")), size=3) +
geom_segment(aes(xend=fct_reorder(locale, acc2), y=acc1, yend=acc2-0.005),
arrow=arrow(length=unit(0.3, "cm"), type="closed")
) +
coord_flip() +
labs(x="",
y="Average Accuracy",
title="Average Accuracy Predicting Locale",
subtitle="Predictions made 1:1 to each other locale (average accuracy reported)",
caption="Temperature, Dew Point, month as predictors\n(50% is baseline coinflip accuracy)"
) +
ylim(c(0.5, 1))
The increase in accuracy is particularly striking for New Orleans and Houston.
Next, the simple model is run to classify locale across the full dataset. The rfTwoLocales() function can be called in a slightly modified manner for this purpose:
# Run random forest for multiple locales
rfMultiLocale <- function(tbl,
vrbls,
locs=NULL,
locVar="source",
otherVar="dtime",
pred="locale",
seed=NULL,
ntree=100,
mtry=NULL,
testSize=0.3
) {
# FUNCTION ARGUMENTS:
# df: the data frame or tibble
# vrbls: explanatory variables for modeling
# locs: the locations to use (NULL means all)
# locVar: the name of the variable where locs can be found
# otherVar: other variables to be kept, but not used in modeling
# pred: predictor variable for modeling
# seed: the random seed (NULL means no seed)
# ntree: the number of trees to grow in the random forest
# mtry: the splitting parameter for the random forest (NULL means use all variables)
# testSize: the fractional portion of data that should be used as the test dataset
# Create locs if it has not been passed
if (is.null(locs)) {
locs <- tbl %>% pull(locVar) %>% unique() %>% sort()
cat("\nRunning for locations:\n")
print(locs)
}
# Pass to rfTwoLocales
rfOut <- rfTwoLocales(tbl,
loc1=locs,
loc2=c(),
locVar=locVar,
otherVar=otherVar,
vrbls=vrbls,
pred=pred,
seed=seed,
ntree=ntree,
mtry=mtry,
testSize=testSize
)
# Return the list object
rfOut
}
The function can then be applied to the 2016 data:
# Run random forest for all 2016 locales
rf_all_2016_TDm <- rfMultiLocale(metarData,
vrbls=c("TempF", "DewF", "month"),
locs=names_2016,
ntree=50,
seed=2006201341
)
Summaries can then be created for the accuracy in predicting each locale:
# Create summary of accuracy
all2016Accuracy <- rf_all_2016_TDm$testData %>%
mutate(locale=factor(locale, levels=levels(predicted))) %>%
count(locale, predicted, correct) %>%
group_by(locale) %>%
mutate(pct=n/sum(n)) %>%
ungroup()
# Calculate the number of levels
nLevels <- length(levels(factor(all2016Accuracy$locale)))
nullAcc <- 1 / nLevels
# Create plot for overall accuracy
all2016Accuracy %>%
filter(locale==predicted) %>%
ggplot(aes(x=fct_reorder(locale, pct))) +
geom_point(aes(y=pct), size=2) +
geom_text(aes(y=pct+0.04, label=paste0(round(100*pct), "%"))) +
geom_hline(aes(yintercept=nullAcc), lty=2) +
coord_flip() +
ylim(0, 1) +
labs(x="",
y="Correctly Predicted",
title="Accuracy of Locale Predictions",
subtitle="(positive detection rate by locale)",
caption=paste0("Temperature, Dew Point, Month as predictors\n(",
round(100*nullAcc),
"% is baseline null accuracy)"
)
)
# Order locales sensibly
locOrder <- all2016Accuracy %>%
filter(correct) %>%
arrange(pct) %>%
pull(locale)
# Create plot for which locales are classified as each other
all2016Accuracy %>%
mutate(locale=factor(locale, levels=locOrder),
predPretty=factor(str_replace(predicted, pattern=", ", replacement="\n"),
levels=str_replace(locOrder, pattern=", ", replacement="\n")
)
) %>%
ggplot(aes(y=locale, x=predPretty)) +
geom_tile(aes(fill=pct)) +
geom_text(aes(label=paste0(round(100*pct), "%"))) +
scale_fill_continuous("% Predicted As", low="white", high="green") +
scale_x_discrete(position="top") +
theme(axis.text.x=element_text(angle=90)) +
labs(x="",
y="Actual Locale",
title="Predicted Locale vs. Actual Locale",
caption="Temperature, Dew Point, Month as predictors"
)
The model is fairly accurate in predicting Las Vegas and San Diego. The model frequently predicts Houston and New Orleans as one of the group, but often classifying one locale as the other.
Accuracy is much lower for many of the cold wether cities. While there is improvement relative to the null accuracy, there is significant overlap in the temperature and dew point by month.
Clouds (minimum levels and ceiling heights) can potentially help further differentiate the cold weather cities on the downwind side of the lake, as well as helping to further pull apart Las Vegas (almost always clear) and San Diego (frequent marine layer).
A helper function is built to map cloud heights to buckets, then minimum cloud height and ceiling are added to the model:
# Convert cloud height to buckets and factor
mapCloudHeight <- function(x) {
factor(case_when(x==-100 ~ "None",
x <= 1000 ~ "Surface",
x <= 3000 ~ "Low",
x <= 6000 ~ "Medium",
x <= 12000 ~ "High",
TRUE ~ "Error"
),
levels=c("Surface", "Low", "Medium", "High", "None")
)
}
# Apply to metarData and keep variables of interest
modData <- metarData %>%
mutate(minHeight=mapCloudHeight(minHeight), ceilingHeight=mapCloudHeight(ceilingHeight))
An updated random forest model is then run:
# Run random forest for all 2016 locales
rf_all_2016_TDmc <- rfMultiLocale(modData,
vrbls=c("TempF", "DewF", "month", "minHeight", "ceilingHeight"),
locs=names_2016,
ntree=50,
seed=2006201355
)
The evaluation process is converted to a function:
# Evaluate model predictions
evalPredictions <- function(lst,
plotCaption,
keyVar="locale"
) {
# FUNCTION ARGUMENTS:
# lst: the list containing outputs of the modeling
# plotCaption: description of predictors used
# keyVar: the variable that represents truth when assessing predictions
# Create summary of accuracy
all2016Accuracy <- lst$testData %>%
mutate(locale=factor(get(keyVar), levels=levels(predicted))) %>%
count(locale, predicted, correct) %>%
group_by(locale) %>%
mutate(pct=n/sum(n)) %>%
ungroup()
# Calculate the number of levels
nLevels <- length(levels(factor(all2016Accuracy$locale)))
nullAcc <- 1 / nLevels
# Create plot for overall accuracy
p1 <- all2016Accuracy %>%
filter(locale==predicted) %>%
ggplot(aes(x=fct_reorder(locale, pct))) +
geom_point(aes(y=pct), size=2) +
geom_text(aes(y=pct+0.04, label=paste0(round(100*pct), "%"))) +
geom_hline(aes(yintercept=nullAcc), lty=2) +
coord_flip() +
ylim(0, 1) +
labs(x="",
y="Correctly Predicted",
title="Accuracy of Locale Predictions",
subtitle="(positive detection rate by locale)",
caption=paste0(plotCaption,
" as predictors\n(",
round(100*nullAcc),
"% is baseline null accuracy)"
)
)
print(p1)
# Order locales sensibly
locOrder <- all2016Accuracy %>%
filter(correct) %>%
arrange(pct) %>%
pull(locale)
# Create plot for which locales are classified as each other
p2 <- all2016Accuracy %>%
mutate(locale=factor(locale, levels=locOrder),
predPretty=factor(str_replace(predicted, pattern=", ", replacement="\n"),
levels=str_replace(locOrder, pattern=", ", replacement="\n")
)
) %>%
ggplot(aes(y=locale, x=predPretty)) +
geom_tile(aes(fill=pct)) +
geom_text(aes(label=paste0(round(100*pct), "%"))) +
scale_fill_continuous("% Predicted As", low="white", high="green") +
scale_x_discrete(position="top") +
theme(axis.text.x=element_text(angle=90)) +
labs(x="",
y="Actual Locale",
title="Predicted Locale vs. Actual Locale",
caption=paste0(plotCaption, " as predictors")
)
print(p2)
# Return the accuracy object
all2016Accuracy
}
evalPredictions(rf_all_2016_TDmc, plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height")
## # A tibble: 225 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Chicago, IL (2016) Chicago, IL (2016) TRUE 706 0.266
## 2 Chicago, IL (2016) Detroit, MI (2016) FALSE 197 0.0742
## 3 Chicago, IL (2016) Grand Rapids, MI (2016) FALSE 180 0.0678
## 4 Chicago, IL (2016) Green Bay, WI (2016) FALSE 119 0.0448
## 5 Chicago, IL (2016) Houston, TX (2016) FALSE 56 0.0211
## 6 Chicago, IL (2016) Indianapolis, IN (2016) FALSE 263 0.0991
## 7 Chicago, IL (2016) Las Vegas, NV (2016) FALSE 28 0.0105
## 8 Chicago, IL (2016) Lincoln, NE (2016) FALSE 119 0.0448
## 9 Chicago, IL (2016) Madison, WI (2016) FALSE 168 0.0633
## 10 Chicago, IL (2016) Milwaukee, WI (2016) FALSE 220 0.0829
## # ... with 215 more rows
Accuracy increases, especially for San Diego and several of the cold wether cities.
The model can also be built out to consider wind speed and wind direction. No attempt yet is made to control for over-fitting:
# Run random forest for all 2016 locales
rf_all_2016_TDmcw <- rfMultiLocale(modData,
vrbls=c("TempF", "DewF",
"month",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=names_2016,
ntree=50,
seed=2006201355
)
The evaluation process can again be run:
evalPredictions(rf_all_2016_TDmcw,
plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction"
)
## # A tibble: 225 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Chicago, IL (2016) Chicago, IL (2016) TRUE 922 0.347
## 2 Chicago, IL (2016) Detroit, MI (2016) FALSE 179 0.0674
## 3 Chicago, IL (2016) Grand Rapids, MI (2016) FALSE 186 0.0701
## 4 Chicago, IL (2016) Green Bay, WI (2016) FALSE 139 0.0524
## 5 Chicago, IL (2016) Houston, TX (2016) FALSE 28 0.0105
## 6 Chicago, IL (2016) Indianapolis, IN (2016) FALSE 186 0.0701
## 7 Chicago, IL (2016) Las Vegas, NV (2016) FALSE 8 0.00301
## 8 Chicago, IL (2016) Lincoln, NE (2016) FALSE 101 0.0380
## 9 Chicago, IL (2016) Madison, WI (2016) FALSE 183 0.0689
## 10 Chicago, IL (2016) Milwaukee, WI (2016) FALSE 285 0.107
## # ... with 215 more rows
Including wind significantly improves model accuracy for most locales. Even the cold weather cities are now being predicted with 33%-50% accurcay.
Next, a smaller subset of the weather data is explored, with the cities grouped as:
Further, a variable will be created for “hr”, the Zulu hour of the observation:
locale2016Mapper <- c('Cold-MI', 'Newark', 'Cold', 'Cold-MI', 'Humid', 'Cold', 'Desert', 'Lincoln', 'Cold', 'Cold', 'Cold', 'Humid', 'Cold', 'Marine', 'Cold-MI')
names(locale2016Mapper) <- c('Detroit, MI (2016)', 'Newark, NJ (2016)', 'Green Bay, WI (2016)', 'Grand Rapids, MI (2016)', 'Houston, TX (2016)', 'Indianapolis, IN (2016)', 'Las Vegas, NV (2016)', 'Lincoln, NE (2016)', 'Milwaukee, WI (2016)', 'Madison, WI (2016)', 'Minneapolis, MN (2016)', 'New Orleans, LA (2016)', 'Chicago, IL (2016)', 'San Diego, CA (2016)', 'Traverse City, MI (2016)')
tibble::tibble(locale=names(locale2016Mapper), mapping=locale2016Mapper) %>%
arrange(mapping, locale)
## # A tibble: 15 x 2
## locale mapping
## <chr> <chr>
## 1 Chicago, IL (2016) Cold
## 2 Green Bay, WI (2016) Cold
## 3 Indianapolis, IN (2016) Cold
## 4 Madison, WI (2016) Cold
## 5 Milwaukee, WI (2016) Cold
## 6 Minneapolis, MN (2016) Cold
## 7 Detroit, MI (2016) Cold-MI
## 8 Grand Rapids, MI (2016) Cold-MI
## 9 Traverse City, MI (2016) Cold-MI
## 10 Las Vegas, NV (2016) Desert
## 11 Houston, TX (2016) Humid
## 12 New Orleans, LA (2016) Humid
## 13 Lincoln, NE (2016) Lincoln
## 14 San Diego, CA (2016) Marine
## 15 Newark, NJ (2016) Newark
modData <- modData %>%
mutate(hr=lubridate::hour(dtime),
locType=locale2016Mapper[locale]
)
modData %>%
count(locType, year) %>%
pivot_wider(locType, names_from="year", values_from="n")
## # A tibble: 8 x 4
## locType `2016` `2015` `2017`
## <chr> <int> <int> <int>
## 1 Cold 52521 NA NA
## 2 Cold-MI 26300 NA NA
## 3 Desert 8770 NA NA
## 4 Humid 17535 NA NA
## 5 Lincoln 8765 NA NA
## 6 Marine 8762 NA NA
## 7 Newark 8773 NA NA
## 8 <NA> NA 34919 34871
The data are the filtered so that there are an equal number of observations from each locale type. The model can later be predicted against the remaining items:
# Set a seed for reporducibility
set.seed(2006211352)
# Find the smallest locale type
nSmall <- modData %>%
filter(!is.na(locType)) %>%
count(locType) %>%
pull(n) %>%
min()
# Create the relevant data subset
subData <- modData %>%
filter(!is.na(locType)) %>%
group_by(locType) %>%
sample_n(size=nSmall, replace=FALSE) %>%
ungroup()
# Sumarize the data subset
subData %>%
count(locale) %>%
arrange(-n)
## # A tibble: 15 x 2
## locale n
## <chr> <int>
## 1 Las Vegas, NV (2016) 8762
## 2 Lincoln, NE (2016) 8762
## 3 Newark, NJ (2016) 8762
## 4 San Diego, CA (2016) 8762
## 5 New Orleans, LA (2016) 4401
## 6 Houston, TX (2016) 4361
## 7 Grand Rapids, MI (2016) 2953
## 8 Traverse City, MI (2016) 2951
## 9 Detroit, MI (2016) 2858
## 10 Milwaukee, WI (2016) 1503
## 11 Chicago, IL (2016) 1482
## 12 Minneapolis, MN (2016) 1473
## 13 Madison, WI (2016) 1452
## 14 Green Bay, WI (2016) 1439
## 15 Indianapolis, IN (2016) 1413
The previous model can then be run on the data subset:
# Run random forest for all 2016 locale types
rf_types_2016_TDmcw <- rfMultiLocale(subData,
vrbls=c("TempF", "DewF",
"month",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=NULL,
locVar="locType",
pred="locType",
ntree=50,
seed=2006211401
)
##
## Running for locations:
## [1] "Cold" "Cold-MI" "Desert" "Humid" "Lincoln" "Marine" "Newark"
The evaluation process can again be run:
evalPredictions(rf_types_2016_TDmcw,
plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction",
keyVar="locType"
)
## # A tibble: 49 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Cold Cold TRUE 1127 0.429
## 2 Cold Cold-MI FALSE 633 0.241
## 3 Cold Desert FALSE 47 0.0179
## 4 Cold Humid FALSE 86 0.0327
## 5 Cold Lincoln FALSE 350 0.133
## 6 Cold Marine FALSE 63 0.0240
## 7 Cold Newark FALSE 322 0.123
## 8 Cold-MI Cold FALSE 546 0.208
## 9 Cold-MI Cold-MI TRUE 1327 0.506
## 10 Cold-MI Desert FALSE 39 0.0149
## # ... with 39 more rows
Predictive ability is strongest for the well-differentiated types, and weakest for the more poorly differentiated types. The model is about 50% accurate in classifying cold climates; 25% of the time they are classified as they other type of cold climate, and 25% of the time they are classified as something else (typically Lincoln or Newark).
The modeling is run again using a much smaller training dataset (testSize set to 0.9) to see the implication of a much smaller data volume:
# Run random forest for all 2016 locale types
rf_types_2016_small_TDmcw <- rfMultiLocale(subData,
vrbls=c("TempF", "DewF",
"month",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=NULL,
locVar="locType",
pred="locType",
ntree=50,
seed=2006211419,
testSize=0.9
)
##
## Running for locations:
## [1] "Cold" "Cold-MI" "Desert" "Humid" "Lincoln" "Marine" "Newark"
The evaluation process can again be run:
evalPredictions(rf_types_2016_small_TDmcw,
plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction",
keyVar="locType"
)
## # A tibble: 49 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Cold Cold TRUE 2510 0.318
## 2 Cold Cold-MI FALSE 2123 0.269
## 3 Cold Desert FALSE 150 0.0190
## 4 Cold Humid FALSE 322 0.0407
## 5 Cold Lincoln FALSE 1351 0.171
## 6 Cold Marine FALSE 239 0.0302
## 7 Cold Newark FALSE 1210 0.153
## 8 Cold-MI Cold FALSE 1678 0.216
## 9 Cold-MI Cold-MI TRUE 3354 0.431
## 10 Cold-MI Desert FALSE 175 0.0225
## # ... with 39 more rows
As expected, predictions are less accurate with a smaller training dataset. Overall acuuracy falls from ~70% to ~60% with the much smaller training data volumes. This suggests the model is continuing to learn from the data, and may benefit from an expanded sample.
Next, the model is adapted to include the hour (as an integer, which may need to be rethought, though this will generally capture whether it is daytime or nighttime even without factoring) and the sea-level pressure variable:
# Run random forest for all 2016 locale types - add hour and modSLP, limit mtry to 4
rf_types_2016_TDmcwha <- rfMultiLocale(subData,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="locType",
pred="locType",
ntree=50,
seed=2006211423,
mtry=4
)
##
## Running for locations:
## [1] "Cold" "Cold-MI" "Desert" "Humid" "Lincoln" "Marine" "Newark"
The evaluation process can again be run:
evalPredictions(rf_types_2016_TDmcwha,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP",
keyVar="locType"
)
## Warning: Removed 1 rows containing missing values (geom_text).
## # A tibble: 49 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Cold Cold TRUE 1265 0.480
## 2 Cold Cold-MI FALSE 615 0.233
## 3 Cold Desert FALSE 18 0.00683
## 4 Cold Humid FALSE 59 0.0224
## 5 Cold Lincoln FALSE 393 0.149
## 6 Cold Marine FALSE 53 0.0201
## 7 Cold Newark FALSE 233 0.0884
## 8 Cold-MI Cold FALSE 496 0.194
## 9 Cold-MI Cold-MI TRUE 1521 0.596
## 10 Cold-MI Desert FALSE 22 0.00861
## # ... with 39 more rows
Accuracy increases to 75%. In particular, the model is better able to distinguish Lincoln and Newark from the remaining locations, which has a follow-on effect of improving cold weather classifications.
Exploration is then run on distinguishing two locales from each other, to see the implications of parameters like data volume (controlled through testSize) and the number of trees.
Comparisons will be run as follows:
ntrees <- c(10, 25, 50, 100, 250, 500)
testSizes <- c(0.9, 0.7, 0.3)
desertHumid <- vector("list", length(ntrees) * length(testSizes))
n <- 1
for (ntree in ntrees) {
for (testSize in testSizes) {
desertHumid[[n]] <- rfTwoLocales(subData,
loc1="Desert",
loc2="Humid",
locVar="locType",
vrbls=c("TempF", "DewF", "month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir", "modSLP"
),
pred="locType",
seed=2006211432,
ntree=ntree,
mtry=4,
testSize=testSize
)
desertHumid[[n]]$ntree <- ntree
desertHumid[[n]]$testSize <- testSize
n <- n+1
}
}
An assessment of accuracy can then be performed:
df <- sapply(desertHumid,
FUN=function(x) { c(x[["errorRate"]]["OOB"], ntree=x$ntree, testSize=x$testSize) }
) %>%
t() %>%
tibble::as_tibble()
df %>%
mutate(accuracy=1-OOB, trainSize=1-testSize) %>%
ggplot(aes(x=ntree, y=trainSize)) +
geom_text(aes(label=paste0(round(100*accuracy), "%"))) +
scale_x_log10()
As expected, accuracy is very high for distinguishing Desert and Humid climates, even with small data volumes (training size 10% of the data) and a small number of trees.
The approach is then run for the cold weather cities:
ntrees <- c(10, 25, 50, 100, 250, 500)
testSizes <- c(0.9, 0.7, 0.3)
coldColdMI <- vector("list", length(ntrees) * length(testSizes))
n <- 1
for (ntree in ntrees) {
for (testSize in testSizes) {
coldColdMI[[n]] <- rfTwoLocales(subData,
loc1="Cold",
loc2="Cold-MI",
locVar="locType",
vrbls=c("TempF", "DewF", "month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir", "modSLP"
),
pred="locType",
seed=2006211501,
ntree=ntree,
mtry=4,
testSize=testSize
)
coldColdMI[[n]]$ntree <- ntree
coldColdMI[[n]]$testSize <- testSize
n <- n+1
}
}
An assessment of accuracy can then be performed:
dfColdColdMI <- sapply(coldColdMI,
FUN=function(x) { c(x[["errorRate"]]["OOB"], ntree=x$ntree, testSize=x$testSize) }
) %>%
t() %>%
tibble::as_tibble()
dfColdColdMI %>%
mutate(accuracy=1-OOB, trainSize=1-testSize) %>%
ggplot(aes(x=ntree, y=trainSize)) +
geom_text(aes(label=paste0(round(100*accuracy), "%"))) +
scale_x_log10()
Greater data volumes and greater tree depths are more helpful for distinguishing cold weather city types from each other. The smallest forest has a 56% accuracy (null 50%) while the largest has a 68% accuracy. This suggests some potential upside to continuing the cold weather modeling process with more cities added for greater data volumes.
Data from years other than 2016 can be fed in to the model, with predictions made to see whether the model is learning general differences in climate or specific features about 2016.
Data that were not included in the modeling can be predicted, with the results assessed:
Example code includes:
nonUsedData <- modData %>%
anti_join(select(subData, source, dtime))
## Joining, by = c("source", "dtime")
nonUsedData %>%
mutate(localeName=str_replace(locale, pattern=" .{1}\\d{4}.*", replacement="")) %>%
count(localeName, year) %>%
pivot_wider(localeName, names_from="year", values_from="n")
## # A tibble: 15 x 4
## localeName `2015` `2016` `2017`
## <chr> <int> <int> <int>
## 1 Chicago, IL 8728 7285 8740
## 2 Detroit, MI NA 5912 NA
## 3 Grand Rapids, MI NA 5811 NA
## 4 Green Bay, WI NA 7316 NA
## 5 Houston, TX NA 4407 NA
## 6 Indianapolis, IN NA 7307 NA
## 7 Las Vegas, NV 8727 8 8664
## 8 Lincoln, NE NA 3 NA
## 9 Madison, WI NA 7298 NA
## 10 Milwaukee, WI NA 7257 NA
## 11 Minneapolis, MN NA 7296 NA
## 12 New Orleans, LA 8727 4366 8723
## 13 Newark, NJ NA 11 NA
## 14 San Diego, CA 8737 NA 8744
## 15 Traverse City, MI NA 5815 NA
A function is written to make predictions and plot the results:
# Function to predict model on to df, then plot the outcomes
helperPredictPlot <- function(model,
df,
vrbls=NULL,
origVar="locale",
limitObs=500,
predOrder=NULL,
locMapper=NULL
) {
# FUNCTION ARGUMENTS:
# model: a trained model
# df: a data frame or tibble for the model to be predicted on to
# vrbls: the variables needed by model (records with NA will be deleted)
# origVar: the original (truth) variable from df
# limitObs: the minimum number of observations for a locale to be plotted
# predOrder: ordering of factor variable "predicted" (NULL means leave as-is)
# locMapper: mapping file for locations (will order locale to match predicted if not NULL)
# Get variable names if NULL, assuming random forest
if (is.null(vrbls)) {
vrbls <- model$importance %>% rownames()
}
# Filter df to have only complete cases for vrbls
dfPred <- df %>%
filter_at(vars(all_of(vrbls)), all_vars(!is.na(.)))
# Make predictions
dfPred <- dfPred %>%
mutate(predicted=predict(model, newdata=select_at(dfPred, vars(all_of(vrbls)))))
# Summarize predictions
dfAccuracy <- dfPred %>%
group_by_at(vars(all_of(c(origVar, "predicted")))) %>%
summarize(n=n()) %>%
mutate(pct=n/sum(n)) %>%
ungroup()
# Filter dfAccuracy to only origVar with count of at least limitObs
hasEnough <- dfAccuracy %>%
group_by_at(vars(all_of(origVar))) %>%
summarize(n=sum(n)) %>%
filter(n > limitObs) %>%
pull(origVar)
# Create the proper order for the factor variable 'predicted' if requested
if (!is.null(predOrder)) {
dfAccuracy <- dfAccuracy %>%
mutate(predicted=factor(predicted, levels=predOrder))
}
# Create the base locale mapping
locales <- dfAccuracy %>%
pull(origVar) %>%
unique()
# Modify the locale mapping if a mapper is provided
if (!is.null(locMapper) & !is.null(predOrder)) {
localeStrip <- str_replace(locales, pattern=" .\\d{4}.", replacement="")
locales <- locales[order(match(locMapper[localeStrip], predOrder))]
}
# Create plot for which locales are classified as each other
p1 <- dfAccuracy %>%
filter_at(vars(all_of(origVar)), all_vars(. %in% hasEnough)) %>%
mutate(locale=factor(get(origVar), levels=locales)) %>%
ggplot(aes(y=locale, x=predicted)) +
geom_tile(aes(fill=pct)) +
geom_text(aes(label=paste0(round(100*pct), "%"))) +
scale_fill_continuous("% Predicted As", low="white", high="green") +
scale_x_discrete(position="top") +
theme(axis.text.x=element_text(angle=90)) +
labs(x="",
y="Actual Locale",
title="Predicted Locale vs. Actual Locale"
)
print(p1)
# Return the predictions summary
dfAccuracy
}
The function can then be applied to the 2016 data:
# Modify the file for mapping locale to type
localeMapper <- locale2016Mapper
names(localeMapper) <- str_replace(names(localeMapper), pattern=" .\\d{4}.", replacement="")
# Predictions on 2016 data
helperPredictPlot(rf_types_2016_TDmcwha$rfModel,
df=filter(nonUsedData, year==2016),
predOrder=c("Cold", "Cold-MI", "Lincoln", "Newark", "Marine", "Humid", "Desert"),
locMapper=localeMapper
)
## # A tibble: 83 x 4
## locale predicted n pct
## <chr> <fct> <int> <dbl>
## 1 Chicago, IL (2016) Cold 3608 0.496
## 2 Chicago, IL (2016) Cold-MI 1792 0.246
## 3 Chicago, IL (2016) Desert 60 0.00825
## 4 Chicago, IL (2016) Humid 165 0.0227
## 5 Chicago, IL (2016) Lincoln 766 0.105
## 6 Chicago, IL (2016) Marine 157 0.0216
## 7 Chicago, IL (2016) Newark 728 0.100
## 8 Detroit, MI (2016) Cold 1111 0.189
## 9 Detroit, MI (2016) Cold-MI 3290 0.560
## 10 Detroit, MI (2016) Desert 66 0.0112
## # ... with 73 more rows
The predictions seem in-line with the test dataset conclusions, as would be expected given that the “unused” 2016 data are randomly sampled and should be no different than that “test” 2016 dataset.
The function can then be applied to the 2015 and 2017 data:
# Predictions on 2015/2017 data
helperPredictPlot(rf_types_2016_TDmcwha$rfModel,
df=filter(nonUsedData, year!=2016),
predOrder=c("Cold", "Cold-MI", "Lincoln", "Newark", "Marine", "Humid", "Desert"),
locMapper=localeMapper
)
## # A tibble: 56 x 4
## locale predicted n pct
## <chr> <fct> <int> <dbl>
## 1 Chicago, IL (2015) Cold 3043 0.349
## 2 Chicago, IL (2015) Cold-MI 2269 0.260
## 3 Chicago, IL (2015) Desert 137 0.0157
## 4 Chicago, IL (2015) Humid 257 0.0295
## 5 Chicago, IL (2015) Lincoln 1373 0.157
## 6 Chicago, IL (2015) Marine 223 0.0256
## 7 Chicago, IL (2015) Newark 1418 0.163
## 8 Chicago, IL (2017) Cold 2537 0.291
## 9 Chicago, IL (2017) Cold-MI 2299 0.263
## 10 Chicago, IL (2017) Desert 227 0.0260
## # ... with 46 more rows
Model performance on 2015 and 2017 data is not as strong, with roughly a 10%-20% loss of accuracy. Predictions are still much better than null accuracy, and the model (mostly) continues to separate the three well-differentiated types (Desert, Humid, Marine).
However, the model struggles to classify Chicago, placing it roughly equally as Cold (correct), Cold-MI, Lincoln, and Newark. This suggests cold-weather cities may be predicted on specific anomalies of 2016 (e.g., where the cold snaps hit). More data may help the model generalize better, specifcally to separate recurring and generalizable climate features of Chicago from weather anomalies that happened to hit Chicago in 2016.
Suppose that models are run on all 2015-2017 data for Chicago, Las Vegas, New Orleans, and San Diego:
# Create the subset for Chicago, Las Vegas, New Orleans, San Diego (should have 2015, 2016, 2017)
sub_2015_2017_data <- modData %>%
filter(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan")) %>%
mutate(city=str_replace(locale, pattern=" .\\d{4}.", replacement=""))
# Check that proper locales are included
sub_2015_2017_data %>%
count(city, locale)
## # A tibble: 12 x 3
## city locale n
## <chr> <chr> <int>
## 1 Chicago, IL Chicago, IL (2015) 8728
## 2 Chicago, IL Chicago, IL (2016) 8767
## 3 Chicago, IL Chicago, IL (2017) 8740
## 4 Las Vegas, NV Las Vegas, NV (2015) 8727
## 5 Las Vegas, NV Las Vegas, NV (2016) 8770
## 6 Las Vegas, NV Las Vegas, NV (2017) 8664
## 7 New Orleans, LA New Orleans, LA (2015) 8727
## 8 New Orleans, LA New Orleans, LA (2016) 8767
## 9 New Orleans, LA New Orleans, LA (2017) 8723
## 10 San Diego, CA San Diego, CA (2015) 8737
## 11 San Diego, CA San Diego, CA (2016) 8762
## 12 San Diego, CA San Diego, CA (2017) 8744
# Run random forest for 2015-2017 data
rf_types_2015_2017_TDmcwha <- rfMultiLocale(sub_2015_2017_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="city",
pred="city",
ntree=25,
seed=2006221334,
mtry=4
)
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
evalPredictions(rf_types_2015_2017_TDmcwha,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP",
keyVar="city"
)
## # A tibble: 16 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Chicago, IL Chicago, IL TRUE 7473 0.941
## 2 Chicago, IL Las Vegas, NV FALSE 110 0.0139
## 3 Chicago, IL New Orleans, LA FALSE 162 0.0204
## 4 Chicago, IL San Diego, CA FALSE 196 0.0247
## 5 Las Vegas, NV Chicago, IL FALSE 122 0.0158
## 6 Las Vegas, NV Las Vegas, NV TRUE 7434 0.960
## 7 Las Vegas, NV New Orleans, LA FALSE 49 0.00633
## 8 Las Vegas, NV San Diego, CA FALSE 141 0.0182
## 9 New Orleans, LA Chicago, IL FALSE 158 0.0201
## 10 New Orleans, LA Las Vegas, NV FALSE 70 0.00891
## 11 New Orleans, LA New Orleans, LA TRUE 7356 0.937
## 12 New Orleans, LA San Diego, CA FALSE 268 0.0341
## 13 San Diego, CA Chicago, IL FALSE 132 0.0168
## 14 San Diego, CA Las Vegas, NV FALSE 188 0.0239
## 15 San Diego, CA New Orleans, LA FALSE 152 0.0194
## 16 San Diego, CA San Diego, CA TRUE 7380 0.940
Even with a very small forest (25 trees), the model is almost always separating Las Vegas, Chicago, San Diego, and New Orleans. While the climates are very different in these cities, it is striking that the model has so few misclassifications.
How do other cities map against these classifications?
# Predictions on 2015/2017 data
helperPredictPlot(rf_types_2015_2017_TDmcwha$rfModel,
df=filter(modData, !(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"))),
predOrder=c("Chicago, IL", "San Diego, CA", "New Orleans, LA", "Las Vegas, NV")
)
## # A tibble: 44 x 4
## locale predicted n pct
## <chr> <fct> <int> <dbl>
## 1 Detroit, MI (2016) Chicago, IL 7614 0.873
## 2 Detroit, MI (2016) Las Vegas, NV 262 0.0300
## 3 Detroit, MI (2016) New Orleans, LA 553 0.0634
## 4 Detroit, MI (2016) San Diego, CA 292 0.0335
## 5 Grand Rapids, MI (2016) Chicago, IL 7927 0.918
## 6 Grand Rapids, MI (2016) Las Vegas, NV 133 0.0154
## 7 Grand Rapids, MI (2016) New Orleans, LA 283 0.0328
## 8 Grand Rapids, MI (2016) San Diego, CA 288 0.0334
## 9 Green Bay, WI (2016) Chicago, IL 7958 0.915
## 10 Green Bay, WI (2016) Las Vegas, NV 85 0.00978
## # ... with 34 more rows
The prediction process is re-run excluding the altimeter (modSLP) data:
# Run random forest for 2015-2017 data (exclude modSLP)
rf_types_2015_2017_TDmcwh <- rfMultiLocale(sub_2015_2017_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=NULL,
locVar="city",
pred="city",
ntree=25,
seed=2006221334,
mtry=4
)
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
evalPredictions(rf_types_2015_2017_TDmcwh,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind",
keyVar="city"
)
## # A tibble: 16 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Chicago, IL Chicago, IL TRUE 7296 0.919
## 2 Chicago, IL Las Vegas, NV FALSE 140 0.0176
## 3 Chicago, IL New Orleans, LA FALSE 263 0.0331
## 4 Chicago, IL San Diego, CA FALSE 242 0.0305
## 5 Las Vegas, NV Chicago, IL FALSE 186 0.0240
## 6 Las Vegas, NV Las Vegas, NV TRUE 7317 0.945
## 7 Las Vegas, NV New Orleans, LA FALSE 95 0.0123
## 8 Las Vegas, NV San Diego, CA FALSE 148 0.0191
## 9 New Orleans, LA Chicago, IL FALSE 248 0.0316
## 10 New Orleans, LA Las Vegas, NV FALSE 123 0.0157
## 11 New Orleans, LA New Orleans, LA TRUE 7137 0.909
## 12 New Orleans, LA San Diego, CA FALSE 344 0.0438
## 13 San Diego, CA Chicago, IL FALSE 175 0.0223
## 14 San Diego, CA Las Vegas, NV FALSE 231 0.0294
## 15 San Diego, CA New Orleans, LA FALSE 202 0.0257
## 16 San Diego, CA San Diego, CA TRUE 7244 0.923
Performance drops, since altimeters are meaningfully different in high altitude (Las Vegas) and sea-level (New Orleans and San Diego) locations.
Variable importances are plotted:
helperPlotVarImp <- function(model, titleAdd="", mapper=varMapper) {
p1 <- model %>%
caret::varImp() %>%
rownames_to_column("predictor") %>%
ggplot(aes(x=fct_reorder(paste0(predictor, "\n", varMapper[predictor]), Overall), y=Overall)) +
geom_col(fill="lightblue") +
labs(x="", y="", title=paste0("Variable Importance", titleAdd)) +
coord_flip()
print(p1)
}
helperPlotVarImp(rf_types_2015_2017_TDmcwha$rfModel)
helperPlotVarImp(rf_types_2015_2017_TDmcwh$rfModel, titleAdd=" (Excludes SLP)")
Dew point and temperature by month continue to be strong factors for separating the four cities in this analysis. SLP, minimum cloud height, and prevailing wind direction are also meaningful.
Prediction certainties can also be assessed:
assessPredictionCertainty <- function(lst,
plotCaption,
keyVar="locale",
testData=lst[["testData"]],
listModel="rfModel",
pkVars=c("dtime"),
thresh=0.8,
showHists=FALSE,
showAcc=FALSE
) {
# FUNCTION ARGUMENTS:
# lst: the list containing outputs of the modeling
# plotCaption: description of predictors used
# keyVar: the variable that represents truth when assessing predictions
# testData: the test dataset (default is the 'testData element of list lst)
# listModel: the named element of lst containing the random forest model
# pkVars: the variable(s) of listData that, with keyVar, function as a primary key (unique identifier)
# thresh: threshhold for voting confidence
# showHists: boolean, whether to create histograms for confidence
# showAcc: boolean, for whether to show accuracy by percent of votes (only makes sense when testData and listModel have the same levels for keyVar)
# Get variable names for the random forest
vrbls <- lst[[listModel]]$importance %>% rownames()
# Filter testData to have only complete cases for vrbls
dfPred <- testData %>%
filter_at(vars(all_of(vrbls)), all_vars(!is.na(.))) %>%
select_at(vars(all_of(c(keyVar, pkVars, vrbls))))
# Create voting summary for every element in testData
predSummary <- predict(lst[[listModel]], newdata=dfPred, type="prob") %>%
as.data.frame() %>%
tibble::as_tibble() %>%
bind_cols(select_at(dfPred, vars(all_of(c(keyVar, pkVars))))) %>%
mutate(finalPrediction=predict(lst[[listModel]], newdata=dfPred)) %>%
pivot_longer(cols=-c(keyVar, pkVars, "finalPrediction"),
names_to="prediction",
values_to="pctVotes"
)
# Create summary of maximum prediction
allPredictions <- predSummary %>%
group_by_at(vars(all_of(c(keyVar, pkVars)))) %>%
mutate(maxPct=max(pctVotes), topProb=(pctVotes==maxPct)) %>%
ungroup()
# Create and show histograms if requested
if (showHists) {
# Create plot for confidence level by final prediction
p1 <- allPredictions %>%
filter(finalPrediction==prediction) %>%
ggplot(aes(x=pctVotes, y=..count../sum(..count..))) +
geom_histogram() +
labs(title="Percent of Votes Received When Truth Is:",
y="Frequency of Result",
x="Percent of Votes Received"
) +
facet_wrap(~get(keyVar))
print(p1)
# Create plot for confidence level by final prediction
p2 <- allPredictions %>%
filter(finalPrediction==prediction) %>%
ggplot(aes(x=pctVotes, y=..count../sum(..count..))) +
geom_histogram() +
labs(title="Percent of Votes Received When Final Prediction Is:",
y="Frequency of Result",
x="Percent of Votes Received"
) +
facet_wrap(~finalPrediction)
print(p2)
}
# Create plot of total votes by type
p3 <- allPredictions %>%
group_by_at(vars(all_of(c(keyVar, "prediction")))) %>%
summarize(meanVotes=mean(pctVotes)) %>%
ggplot(aes_string(x="prediction", y=keyVar)) +
geom_tile(aes(fill=meanVotes)) +
geom_text(aes(label=paste0(round(100*meanVotes), "%"))) +
scale_fill_continuous("% Predicted As", low="white", high="green", limits=c(0, 1)) +
scale_x_discrete(position="top") +
# theme(axis.text.x=element_text(angle=90)) +
labs(x="",
y="Actual Locale",
title="Predicted Locale vs. Actual Locale",
subtitle="Average votes received by prediction",
caption=paste0(plotCaption, " as predictors")
)
print(p3)
# Create plot of accuracy when threshhold is met
if (showAcc) {
# Percent accuracy when threshold is met
p4 <- allPredictions %>%
filter(finalPrediction==prediction) %>%
mutate(correct=(get(keyVar)==finalPrediction),
meetThresh=factor(ifelse(pctVotes >= thresh, "Threshold Met", "Threshold Not Met"),
levels=c("Threshold Not Met", "Threshold Met")
)
) %>%
group_by_at(c(keyVar, "meetThresh")) %>%
summarize(pctCorrect=mean(correct)) %>%
ggplot(aes_string(x=keyVar)) +
geom_point(aes(y=pctCorrect)) +
geom_text(aes(y=pctCorrect+0.08, label=paste0(round(100*pctCorrect), "%"))) +
coord_flip() +
ylim(c(0, 1.08)) +
facet_wrap(~meetThresh) +
labs(x="Truth",
y="Percent Correctly Classified",
title="Accuracy of Classifying",
subtitle=paste0("Threshold Met means percent of votes is >= ", round(100*thresh), "%"),
caption=paste0(plotCaption, " as predictors")
)
print(p4)
}
# Create plot of total votes by threshold
p6 <- allPredictions %>%
filter(finalPrediction==prediction) %>%
mutate(modPredict=ifelse(pctVotes>=thresh, prediction, "Too Low")) %>%
group_by_at(vars(all_of(c(keyVar, "modPredict")))) %>%
summarize(n=n()) %>%
mutate(pct=n/sum(n)) %>%
ggplot(aes_string(x="modPredict", y=keyVar)) +
geom_tile(aes(fill=pct)) +
geom_text(aes(label=paste0(round(100*pct), "%"))) +
scale_fill_continuous("% Predicted As", low="white", high="lightblue", limits=c(0, 1)) +
scale_x_discrete(position="top") +
# theme(axis.text.x=element_text(angle=90)) +
labs(x="",
y="Actual Locale",
title="Predicted Locale vs. Actual Locale",
subtitle=paste0("Predictions with <", round(100*thresh), "% of votes classified as 'Too Low'"),
caption=paste0(plotCaption, " as predictors")
)
print(p6)
# Return the accuracy object
allPredictions
}
The assessment can be run for the two main 2015-2017 models:
# Run for the full model including SLP
probs_2015_2017_TDmcwha <-
assessPredictionCertainty(rf_types_2015_2017_TDmcwha,
keyVar="city",
plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind, SLP",
showAcc=TRUE
)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(keyVar)` instead of `keyVar` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(pkVars)` instead of `pkVars` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
# Run for the model excluding SLP
probs_2015_2017_TDmcwh <-
assessPredictionCertainty(rf_types_2015_2017_TDmcwh,
keyVar="city",
plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind",
showAcc=TRUE
)
Predictions with 80%+ of the votes are made ~80% of the time, and these predictions are ~99% accurate. Predictions with <80% of the votes are made ~20% of the times, and these predictions are ~75% accurate. The percentage of votes received appears to be a reasonable proxy for the confidence of the prediction.
A similar process can be run for assessing the classification of the other cities against the 2015-2017 data for Chicago, Las Vegas, New Orleans, and San Diego:
useData <- modData %>%
filter(!(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan")))
# Run for the model excluding SLP
probs_allcities_2015_2017_TDmcwh <-
assessPredictionCertainty(rf_types_2015_2017_TDmcwh,
testData=useData,
keyVar="locale",
plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind",
showHists=TRUE
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Patterns in the predictions stand out: